Chapter 1

I had used R a little previously, so getting started was easy. I managed to install all required packages and knit Exercise Set 1.

Dealing with Github has been quite annoying because the website has been really slow to update after pushing.

R for Health Data Science is helpful for getting started, but on some topics I would prefer more information and examples. I found Exercise Set 1 a useful companion to the book.


Chapter 2

Exercise Set 2.

Data Wrangling

I created “data” folder, and added R script “create_learning2014.R” in the folder. The script reads the learning2014 data from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt, and creates an analysis dataset which is saved in the data folder as learning2014.csv.

Analysis

First we download the learning2014 data. In the file header is given in the first row, and the data separator is “,”.

# read data
students2014 <- read.table("data/learning2014.csv", header = T, sep = ",")
# print dimesions of the table
dim(students2014)
## [1] 166   7
# print first 6 rows
head(students2014)
##   gender age attitude points     deep     surf  stra
## 1      F  53       37     25 3.583333 2.583333 3.375
## 2      M  55       31     12 2.916667 3.166667 2.750
## 3      F  49       25     24 3.500000 2.250000 3.625
## 4      M  53       35     10 3.500000 2.250000 3.125
## 5      M  49       37     22 3.666667 2.833333 3.625
## 6      F  38       38     21 4.750000 2.416667 3.625

Dataset has 7 variables and 166 properties (rows). The variables are:

## [1] "gender"   "age"      "attitude" "points"   "deep"     "surf"     "stra"

Gender is M (Male), or F (Female), and age is given in years. Attitude describes the students total attitude toward statistics. Points are students’ exam results. deep, surf, and stra are averages taken over the students answers to questions on deep, surface and strategic learning.


Next, we can plot graphical overview of the data using GGally and ggplot2 libraries. The variables are plotted separated in terms of gender.

# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(students2014[-1], mapping = aes(col = students2014$gender, alpha = 0.3), 
             lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p

The plots in the diagonal show the density distributions of each value (red for female, blue for male). Scatter plots below the diagonal show how each pair of variables are related.

Above the diagonal are shown total and gender specific correlations between each pair of variables. The number of stars ’*’ next to each value tells how statistically significant the correlations are. The largest total correlation is between attitude and points. Interestingly, for male there is a very negative correlation between deep and surf, while for female these two are completely uncorrelated.


We see that attitude, stra, and surf have the largest correlation with points. Thus, we choose them as explanatory variables and fit them to a regression model with exam points as the target variable.

# create a regression model with multiple explanatory variables
m <- lm(points ~ attitude + stra + surf, data = students2014)

# print out a summary of the model
summary(m)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Looking at the summary, surf is does not have a statistically significant relationship (i.e. p-value is too large) with the target variable. We can remove it and refit the model.

m2 <- lm(points ~ attitude + stra, data = students2014)
summary(m2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

P-values for the remaining variables are now smaller, and the model’s explanatory power is better. There is strong positive correlation between attitude and the target variable, but stra (strategic learning) has also a small effect.

Multiple R-squared tells how well the variables explain the variation of the target. In this case only 20.48 % of variation of exam points is explained by attitude and stra. The explanatory power of the model is quite low.


We can use the model produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

# Plot residuals vs fitted values
plot(m2, which = c(1))

The model assumes a linear regression. If the “Residuals vs Fitted” plot were curved, we might need to include a quadratic term in the model. Based on the plot, a linear model is sufficient. Points 56, 35, and 145 are the biggest outliers.

# Plot Q-Q residuals
plot(m2, which = c(2))

“Q-Q Residuals” follows fairly closely to the straight line apart from a few outliers. This shows that the prediction errors are close to a normal distribution.

# Plot residuals vs Leverage
plot(m2, which = c(5))

Leverage is a measure of how much influence each point has on the model prediction. In “Residuals vs Leverage”, we see that all leverage values are small, but some points have a proportionally larger effect.


Chapter 3

Exercise Set 3.

Data Wrangling

I downloaded data from http://www.archive.ics.uci.edu/dataset/320/student+performance, and added R script create_alc.R in the folder. The scrips merges two datasets of student alcohol consumption, and saves the result in the data folder as alc_use.csv. We only keep students present in both data sets. There are 370 students and 35 varibles in the joined data.

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Firstly, we can download and glimpse at the data from alc_use.csv:

# access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# read data
alc <- read.table("data/alc_use.csv", header = T, sep = ",")
# glimpse at the data
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

Analysis

The names of variables in the dataset are:

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Two new columns have been added to the joined data: alc_use is the average of the answers related to weekday and weekend alcohol consumption, and high_use is set TRUE for students for which alc_use is greater than 2 (FALSE otherwise).


The purpose of this analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. For example, we can create box plots of the final grades (G3) in terms of high use:

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("grade")

Grades of male high alcohol users are lower on average, but not for females. Because of the high variance in either case, grade itself is not a good predictor of alcohol use.

Here, I chose the following 4 variables in the data which appear to most strongly correlate with alcohol use: sex, absences, failures, and goout (frequency of going out with friends). The number of absences can be examined with a boxplot:

# initialize a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 + geom_boxplot() + ylab("absences")

High alcohol use correlates positively with the number absences.

Similar results can be seen with the number of class failures and frequency of going out with friends. We can examining the effect of goout and failures by creating separate bar plots for low and high use.

# initialize a bar plot of high_use and goout
g1 <- ggplot(alc, aes(x = goout, fill=sex)) + geom_bar()
g1 + facet_wrap("high_use")

g2 <- ggplot(alc, aes(x = failures, fill=sex)) + geom_bar()
g2 + facet_wrap("high_use")

For failures the correlation is not as noticeable, so we can also use group_by() to calculate the average number of class failures (divided by sex).

## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 5
## # Groups:   sex [2]
##   sex   high_use count  fail absent
##   <chr> <lgl>    <int> <dbl>  <dbl>
## 1 F     FALSE      154 0.104   4.25
## 2 F     TRUE        41 0.293   6.85
## 3 M     FALSE      105 0.143   2.91
## 4 M     TRUE        70 0.386   6.1

Next, we use logistic regression to statistically explore the relationship between the chosen variables and the binary high/low alcohol consumption variable as the target variable.

## 
## Call:
## glm(formula = high_use ~ sex + absences + failures + goout, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.14389    0.47881  -8.654  < 2e-16 ***
## sexM         0.97989    0.26156   3.746 0.000179 ***
## absences     0.08246    0.02266   3.639 0.000274 ***
## failures     0.48932    0.23073   2.121 0.033938 *  
## goout        0.69801    0.12093   5.772 7.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.50  on 365  degrees of freedom
## AIC: 379.5
## 
## Number of Fisher Scoring iterations: 4

Variable goout has lowest p-value, meaning it most explanatory power in the model. sex and absences also have a high significance, while “failures” has a fairly low significance (high p-value).

The computational target variable in the logistic regression model is the log of odds: \[\log\left( \frac{p}{1 - p} \right).\] Therefore we apply the exponent function to obtain the modeled ratios of propabilities:

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m)
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR       2.5 %     97.5 %
## (Intercept) 0.01586098 -5.12422446 -3.2428932
## sexM        2.66415000  0.47316089  1.5009085
## absences    1.08595465  0.03922013  0.1293995
## failures    1.63121360  0.04208372  0.9518235
## goout       2.00975350  0.46657743  0.9418090

We can add prediction probabilities and predicted values (TRUE or FALSE) to the table, and perform 2x2 cross tabulation of predictions versus the actual values.

# add prediction probabilities and predicted values (TRUE or FALSE)
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   243   16
##    TRUE     61   50

In the table, inaccurately classified individuals are found in the off-diagonal values.

# table of proportional values
t <- table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
t
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65675676 0.04324324 0.70000000
##    TRUE  0.16486486 0.13513514 0.30000000
##    Sum   0.82162162 0.17837838 1.00000000
# total fail rate
(t[1,2] + t[2,1])
## [1] 0.2081081

The training error (i.e. the percentage of wrong predictions) is 20.8%. The prediction accuracy is not too bad.

However, we can simplify the model by dropping both failures and absences:

m <- glm(high_use ~ sex + goout, data = alc, family = "binomial")
# Summarise the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + goout, family = "binomial", data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.8193     0.4572  -8.354  < 2e-16 ***
## sexM          0.9268     0.2507   3.696 0.000219 ***
## goout         0.7578     0.1190   6.369  1.9e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 388.94  on 367  degrees of freedom
## AIC: 394.94
## 
## Number of Fisher Scoring iterations: 4
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

# table of proportional values
t <- table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
# total fail rate
(t[1,2] + t[2,1])
## [1] 0.2135135

Failure rate is very similar at 21.3%, so clearly not all variables are necessary needed to get good predictions.


Finally, let’s perform 10-fold cross-validation on the original model:

# loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2243243

The error rate varies between runs, but on average it is around 21%. This model is clearly better than the one introduced in Exercise Set 3.


Chapter 4

Exercise Set 4.

Analysis

First we load the Boston data from the MASS package.

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

Boston dataset consists of information collected by the U.S Census Service concerning housing in the Boston suburbs, such as per capita crime rate (crim). The data frame has 506 rows and 14 columns.

# print variable names and dimensions of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

We can visualize the relationships between variables by calculating and plotting a correlation matrix:

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) 
round(cor_matrix, digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")

For example, crime correlates the most with accessibility to radial highways (rad) and property-tax rate (tax).

Summarising the data variables:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

For each variable summary prints out the mean, median, minimum and maximum values, and the 1st and 3rd quartiles.


Next we want to standardize the dataset so that them mean is 0 and standard deviation is 1 for each variable.

# center and standardize variables
boston_scaled <- as.data.frame(scale(Boston))
# make crime rate variable numeric
boston_scaled$crim <- as.numeric(boston_scaled$crim)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We can create a categorical variable of the crime rate by dividing the scaled crime rate between the four quantiles.

# divide data to bins
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# print table of the new variable
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# drop the old crime rate variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Finally, we divide the dataset to train and test sets with 80% of the data in the train set:

# choose randomly 80% of the rows
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)

# create train and test set
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear Discriminant Analysis

We use linear discriminant analysis on the train set with crime rate as the target variable and all the other variables as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2648515 0.2277228 0.2574257 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       1.0238605 -0.9793461 -0.11640431 -0.9088110  0.43724680 -0.9170710
## med_low  -0.1097611 -0.2709396 -0.05155709 -0.5520220 -0.21131780 -0.3077811
## med_high -0.3837758  0.2204956  0.07002747  0.4516621  0.08314622  0.3964594
## high     -0.4872402  1.0170690 -0.04518867  1.0572915 -0.41341281  0.8051324
##                 dis        rad        tax     ptratio       black        lstat
## low       0.9723729 -0.6725811 -0.7196436 -0.41987151  0.37766136 -0.789035129
## med_low   0.3382136 -0.5546844 -0.5055807 -0.09083605  0.34788771 -0.074058686
## med_high -0.3986553 -0.4126310 -0.2766658 -0.22899005  0.09008068  0.002386964
## high     -0.8561968  1.6386213  1.5144083  0.78135074 -0.70489361  0.910984773
##                 medv
## low       0.51187718
## med_low  -0.03374719
## med_high  0.16212785
## high     -0.70226104
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.07386515  0.58119845 -0.80823690
## indus    0.05826081 -0.42553489  0.49823325
## chas    -0.11224342  0.08706484  0.13999523
## nox      0.41784425 -0.81308684 -1.48982358
## rm      -0.10650187 -0.05230303 -0.22951413
## age      0.18738135 -0.19338831 -0.03984347
## dis     -0.04819067 -0.14319286  0.06165281
## rad      3.13232869  0.96769635  0.36593275
## tax     -0.02036994  0.16228997  0.01855767
## ptratio  0.12982447 -0.07768621 -0.36870098
## black   -0.14690853  0.05621491  0.22076989
## lstat    0.20684628 -0.16516085  0.31597687
## medv     0.19459127 -0.37616148 -0.30356069
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9421 0.0418 0.0161
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)


We can test the accuracy of the LDA model with predict and cross tabulate the results.

# separate the correct classes from test data
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       8        4    0
##   med_low    6      11        2    0
##   med_high   0      15       18    1
##   high       0       0        0   23

66 / 102 = 64.7% of classes were correctly predicted. High and low crime rate predictions are mostly correct, while the middle crime rate prediction accuracy is quite low.


K-Means Clustering

Similarity between measurements can be measured by calculating the euclidian distances between observation:

# euclidean distance matrix
dist_eu <- dist(Boston)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047

K-means clustering divides the dataset into k clusters based on their distances.

k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Optimal number of clusters is found when the total of within cluster sum of squares (WCSS) is drops radically. In this case the optimal k is 2. Plotting the cluster in part of the dataset, separated by colors:

# k-means clustering
km <- kmeans(Boston, centers = 2)

# plot part of the dataset with clusters
pairs(Boston[c("rm", "age", "dis", "crim")], col = km$cluster)


Bonus

K-means clustering with k=3:

boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)

km <- kmeans(boston_scaled, centers = 3)
pairs(boston_scaled[c("rm", "age", "dis", "crim")], col = km$cluster)

Perform LDA using the clusters as target classes:

train$crime <-as.numeric(train$crime)
km <- kmeans(train, centers = 3)
# linear discriminant analysis
lda.fit <- lda(km$cluster ~ ., data = train)
lda.fit
## Call:
## lda(km$cluster ~ ., data = train)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3836634 0.2772277 0.3391089 
## 
## Group means:
##           zn      indus       chas        nox         rm        age        dis
## 1  0.7431953 -0.9170888 -0.1961271 -0.9005622  0.4694583 -1.0050277  0.9499061
## 2 -0.4872402  1.0575658 -0.0262603  1.0019150 -0.4042927  0.7604060 -0.8346874
## 3 -0.4010165  0.1595450  0.1300023  0.2045813 -0.3013126  0.4763877 -0.3289978
##          rad        tax    ptratio      black      lstat       medv    crime
## 1 -0.6017657 -0.7065261 -0.4145458  0.3648854 -0.7741460  0.5665283 1.458065
## 2  1.5775695  1.5244331  0.8086653 -0.6500602  0.8911452 -0.6951667 3.892857
## 3 -0.5711056 -0.4084738 -0.1332044  0.1941297  0.2009457 -0.1458682 2.518248
## 
## Coefficients of linear discriminants:
##                  LD1         LD2
## zn       0.154085551  0.06103328
## indus    0.391385720  0.15527376
## chas     0.080419215  0.24912662
## nox     -0.230489323 -0.18249444
## rm      -0.061577202 -0.30648071
## age      0.243440665  0.90671299
## dis     -0.272545600 -0.43557841
## rad      3.025058848 -1.84037354
## tax      1.202881792 -0.13433918
## ptratio  0.242578180  0.00851564
## black   -0.004462324  0.04236079
## lstat    0.313414629 -0.20396002
## medv     0.072512784 -0.36100594
## crime   -0.070030332  1.13766168
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9163 0.0837
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crim)

# plot the lda results
plot(lda.fit, dimen = 2, col = km$cluster, pch = km$cluster)
lda.arrows(lda.fit, myscale = 1)

The most influential linear separators for the clusters are tax and rad.


Data Wrangling (for next week)

Added create_human.R to data folder, which downloads “Human development” and “Gender inequality” datasets, joins them and saves the data to human.csv.



Chapter 5

<!DOCTYPE html>

chapter5.knit

Exercise Set 5.

Data Wrangling (continued)

Further data manipulation is done in create_human.csv, and dataset human.csv is saved to data folder. There are now 155 observations and 9 variables. We excluded unneeded variables, and then filtered out rows with missing (NA) values or which relate to regions instead of countries.


Analysis

We open the human data and move the country names to row names:

# access libraries
library(tidyr); library(dplyr); library(ggplot2); library(readr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# read data from human.csv
human <- read_csv('data/human.csv', show_col_types = FALSE)

library(tibble)
human <- column_to_rownames(human, "Country")

We can summarise the data with ggplot and corrplot:

library(GGally);library(corrplot)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## corrplot 0.92 loaded
# visualize the variables
ggpairs(human)

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()

We can deduce that life expectancy at birth Life.Exp has a strong positive correlation with expected years of schooling Edu.Exp. On the other had maternal mortality ratio Mat.Mor and adolescent birth rate Ado.Birth have a strong negative correlation with Life.Exp and Edu.Exp. Gross national income per capita GNI shows similar, though weaker, correlations. Percentange of female representatives in parliament Parli.F, and relative labor force participation Labo.FM correlate with each other but not very significantly with other variables.


Principal component analysis (PCA)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
s <- summary(pca_human)

# draw a biplot
biplot(pca_human, choices = 1:2, cex = c(1, 0.8), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

PC1 and PC2 are the two most significant axis found by PCA. The percentages of variables are shown is brackets. Pink arrows represent the original variables. Unfortunately, the plot is useless, because GNI values are much larger than other variables causing PCA to overvalue its impact.


We standardize the data and perform PCA again

# Standardize data, and perform PCA
pca_human <- prcomp(scale(human))
s <- summary(pca_human)

# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

The lengths of the arrows are now equal because the variables have same standard deviations. The labels are not very clear, but Parli.F and Labo.FM dominate in the PC2 axis and the other variables in the PC1 axis with directions of arrows corresponding with positive and negative correlations.

We can deduce that, not very surprisingly, people in more developed countries are better educated, have higher life expectancy, and have better healthcare outcomes (lower maternal mortality). Additionally, countries were women make a larger percentage of the workforce, also have more female representatives, but this does not necessarily indicate higher development.


Multiple Correspondence Analysis (MCA)

Next, we want to analyze the tea data from the FactoMineR package.

# read tea data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

There are 300 observables and 36 variables. We pick 6 of the variables for analysis.

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, any_of(keep_columns))

# a quick look at the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
# visualizing the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + geom_bar() + facet_wrap("name", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

We perform Multiple Correspondence Analysis on the dataset:

# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

Like with PCA we can draw a biplot of results.

# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

We can see some correlation on how people consume tea, for example, at a tea shop, tea comes unpackaged. However, Dim 1 and 2 cover a fairly small percentage of variances, so does not tell much about the correlations. There is some difference between tea types and additives but enough to draw conclusions.



Chapter 6

<!DOCTYPE html>

chapter6.knit

Exercise Set 6.

Data Wrangling

We prepare BPRS and RATS datasets in meet_and_repeat.R and save them in data folder.

BPRS contains 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS is used to evaluate patients suspected of having schizophrenia.

RATS data is from a nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded over a 9-week period (two times in week 7). The question of most interest is whether the growth profiles of the three groups differ.

# run meet_and_repeat.R to initialize both datasets
source("data/meet_and_repeat.R", local = knitr::knit_global())
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Rows: 40
## Columns: 11
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ week0     <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week1     <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68, …
## $ week2     <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65, …
## $ week3     <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49, …
## $ week4     <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36, …
## $ week5     <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32, …
## $ week6     <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27, …
## $ week7     <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30, …
## $ week8     <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37, …
## Rows: 16
## Columns: 13
## $ ID    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1   <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470,…
## $ WD8   <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 465,…
## $ WD15  <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 475,…
## $ WD22  <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 485,…
## $ WD29  <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 487,…
## $ WD36  <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 493,…
## $ WD43  <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 493,…
## $ WD44  <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 504,…
## $ WD50  <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 507,…
## $ WD57  <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 518,…
## $ WD64  <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 525,…
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …

RATS Analysis

First, let’s plot the rat weight trendlines for each group:

# access libraries
library(tidyr); library(dplyr); library(ggplot2); library(readr);

# Plotting data
ggplot(RATS, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() + 
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  theme(legend.position = "none")

The initial weights differ between groups, so we should also look at standardized values.

# mutate to add stdweights column 
RATS <- RATS %>%
  group_by(Time) %>%
  mutate(StdW = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()

ggplot(RATS, aes(x = Time, y = StdW, linetype = ID)) +
  geom_line() + 
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (standardized)") +
  theme(legend.position = "none")

Rats with larger initial weight tend to gain weight faster. After standardization this is not necessarily so.

We can also create a summary graph for each group, by adding mean and standard error columns to the dataset, and plotting the trends with errorbars:

# summary dataset
RATSS <- RATS %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(length(Weight)) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = 'right') +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

Repeating with stadardized weights:

# summary dataset
RATSS <- RATS %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(StdW), se = sd(StdW)/sqrt(length(StdW)) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = 'right') +
  scale_y_continuous(name = "mean(StdW) +/- se(StdW)")

Boxplots of weights for each group:

RATSS <- RATS %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSS, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Week)")

Boxplots don’t have massive outliers (although one rat in group 2 could be considered one).

Next, we apply a t-test to assess any difference between the groups, and also calculate a confidence interval for this difference. There are 3 groups so I performed the t-test for groups 2 and 3.

RATS2 <- RATS %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Perform a two-sample t-test
t.test(mean ~ Group, data = filter(RATS2, RATS2$Group != 1), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -1.0686, df = 6, p-value = 0.3263
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -130.60428   51.20428
## sample estimates:
## mean in group 2 mean in group 3 
##           487.8           527.5

There is no significant evidence for difference between groups. Also the 95% confidence interval is wide and includes the zero, allowing for similar conclusions to be made.

We can also fit a linear model and compute the analysis of variance (ANOVA) for the model.

# Add the baseline from the original data as a new variable to the summary data
RATS1 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep ="\t", header = T)
RATS2 <- RATS2 %>% mutate(baseline = RATS1$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATS2)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is not significant for between groups.


BPRS Analysis

Plotting BPRS per treatment group:

ggplot(BPRS, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRS$bprs), max(BPRS$bprs)))

First we can fit a multiple linear regression model:

# create a regression model
fit <- lm(bprs ~ week + treatment, data = BPRS)
# print out a summary of the model
summary(fit)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

There is a significant p-value difference over time, but not with different treatments.

The previous model assumes independence of the repeated measures of weight, and this assumption is highly unlikely. So, now we will move on to consider both some more appropriate graphics and appropriate models. We will first fit the random intercept model for the same two explanatory variables using lme4 package. Fitting a random intercept model allows the linear regression fit for each subject to differ in intercept from other subjects.

# access library lme4
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
fit <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
# Print the summary of the model
summary(fit)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Now we can move on to fit the random intercept and random slope model (week | subject) to the data. Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the subject’s profiles, but also the effect of time. We also allow for a week * treatment interaction.

fit <- lmer(bprs ~ week + treatment + (week | subject) + week*treatment, data = BPRS, REML = FALSE)
summary(fit)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + week * treatment
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

Plotting the fitted we can see that the model did not find difference between the two groups:

# Create a vector of the fitted values
Fitted <- fitted(fit)

# Create a new column fitted
BPRS <- BPRS %>% mutate(Fitted = Fitted)

ggplot(BPRS, aes(x = week, y = Fitted, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRS$bprs), max(BPRS$bprs)))